Comparing many distributions

The extended movies dataset

import altair as alt
from vega_datasets import data

movies_extended = data.movies().dropna(subset=['Major_Genre'])
movies_extended
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0
7 Foolish 6026908.0 6026908.0 NaN 1600000.0 Apr 09 1999 R NaN Artisan Original Screenplay Comedy Contemporary Fiction None NaN 3.8 353.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3196 Zack and Miri Make a Porno 31452765.0 36851125.0 21240321.0 24000000.0 Oct 31 2008 R 101.0 Weinstein Co. Original Screenplay Comedy Contemporary Fiction Kevin Smith 65.0 7.0 55687.0
3197 Zodiac 33080084.0 83080084.0 20983030.0 85000000.0 Mar 02 2007 R 157.0 Paramount Pictures Based on Book/Short Story Thriller/Suspense Dramatization David Fincher 89.0 NaN NaN
3198 Zoom 11989328.0 12506188.0 6679409.0 35000000.0 Aug 11 2006 PG NaN Sony Pictures Based on Comic/Graphic Novel Adventure Super Hero Peter Hewitt 3.0 3.4 7424.0
3199 The Legend of Zorro 45575336.0 141475336.0 NaN 80000000.0 Oct 28 2005 PG 129.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 26.0 5.7 21161.0
3200 The Mask of Zorro 93828745.0 233700000.0 NaN 65000000.0 Jul 17 1998 PG-13 136.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 82.0 6.7 4789.0

2926 rows × 16 columns

Many distributions can’t be effectively compared with histograms

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('Worldwide_Gross', bin=alt.Bin(maxbins=30)),
    alt.Y('count()'),
    alt.Color('Major_Genre'))

Many distributions can’t be effectively compared with densities either

(alt.Chart(movies_extended).mark_area().transform_density(
    'Worldwide_Gross',
    groupby=['Major_Genre'],
    as_=['Worldwide_Gross', 'density'])
 .encode(
    alt.X('Worldwide_Gross'),
    alt.Y('density:Q'),
    alt.Color('Major_Genre')))

Bar charts are effective for comparing a single value per group but hides variation

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y('Major_Genre'))

Showing a single value can lead to incorrect conclusions

Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Barplot Hiding Points

Showing individual observations gives a richer representation than bar charts

alt.Chart(movies_extended).mark_tick().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'))

Tooltips are helpful for answering questions about specific observations

alt.Chart(movies_extended).mark_tick().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'),
    alt.Tooltip('Title:N'))

Heatmaps can compare multiple distributions without saturation

(alt.Chart(movies_extended).mark_rect().encode(
    alt.X('Worldwide_Gross', bin=alt.Bin(maxbins=100)),
    alt.Y('Major_Genre'),
    alt.Color('count()')))

Boxplots show several key statistics from a distribution

Jhguch at en.wikipedia via Wikimedia Commons



Barplot Hiding Points

Boxplots can effectively compare multiple distributions

bar = alt.Chart(movies_extended).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y('Major_Genre'))

box = alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'))

box | bar

Sorted boxplots more effective for comparing similar distributions

genre_order = movies_extended.groupby(
    'Major_Genre')['Worldwide_Gross'].median().sort_values().index.tolist()
alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order))

Zooming in facilitates comparison of small differences

filtered_movies = movies_extended[movies_extended['Worldwide_Gross'] < 1_500_000_000]
alt.Chart(filtered_movies).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order))

Boxplots can be scaled by the number of observations

alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order),
    alt.Size('count()'))

Boxplots are not able to accurately represent data with multiple peaks

From Autodesk research

Point Box Violin

Let’s apply what we learned!